Tidyverse Basics

This interactive tutorial introduces the basic features and format of the Tidyverse packages.

In this tutorial, you can write and run code in your browser using WebR.

With WebR, you can run all of the code in a cell with one click. You can also run the code line-by-line or run selected code with the following keyboard shortcuts:

Run selected code:

Run the entire code cell:

Setup

This installs ~25 packages, but it only loads the following ones by default: dplyr, forcats, ggplot2, lubridate, purrr, readr, stringr;,tibble;,and tidyr. Tidyverse packages also tend to be verbose in warning you when there are functions with the same name in multiple packages.

Background

Tidyverse packages do a few things:

  • fix some of the annoying parts of using R, such as changing default options when importing data files and preventing large data frames from printing to the console
  • are focused on working with data frames (and their columns), rather than individual vectors
  • usually take a data frame as the first input to a function, and return a data frame as the output of a function, so that function calls can be more easily strung together in a sequence
  • share some common naming conventions for functions and arguments that have a goal of making code more readable
  • tend to be verbose, opinionated, and are actively working to provide more useful error messages

Tidyverse packages are particularly useful for:

  • data exploration
  • reshaping data sets
  • computing summary measures over groups
  • cleaning up different types of data
  • reading and writing data

Data

Let’s import the data we’ll be using. The data is from the Stanford Open Policing Project and includes a random subset of vehicle stops by the police in Evanston, Illinois 2017. We’re reading the data in from a URL directly.

We’re going to use the read.csv function. This will produce a dataframe in our R environment. However, the tidyverse uses a different standard format for data tables, called the tibble. We can convert a dataframe to a tibble using as_tibble.

In the code below, we are connecting the two functions using the pipe operator, |>. The pipe operator tells R to take the output of the command on the left, and provide it as the first argument (input) to the function on the right. Using pipe operators can simplify writting code for complex sequences of commands.

In this case, since we included a pipe at the end of the first line, R knows the line isn’t finished and will continue onto the next line, running read.csv before storing the results in the object police.

The output message that you get tells you want data type it guessed for each column based on the format of the information. “chr” is character or text data, “dbl” is numeric (stands for double, which is technical term for a type of number), “lgl” is logical/boolean (TRUE/FALSE). Note that it also automatically read and identified date and time values and converted them to date and time objects – not just string/character data.

We can also manually specify column types for cases where the assumption that read_csv makes is wrong. We use the colClasses argument. Let’s make the location to be character data, since it is zip codes – zip codes should not be treated as numbers.

EXERCISE 1

Remember: you need to have loaded tidyverse, so execute the cells above.

Imagine we have a dataset that includes ISO two-letter country codes. The country code for Namibia is NA, so we don’t want to read “NA” in as missing.

Look at the documentation (help) page for read.csv. You can open it by typing ?read.csv in the console. The na.strings argument determines what values are imported as missing NA.

Change the code below so that only empty strings “” and “N/A” values are imported as missing (not “NA”). Look at fix_data after importing so you can check the values.

Tibbles

You may have noticed above that read_csv imported the data as something called a Tibble. Tibbles are the tidyverse version of a data frame. You can use them as you would a data frame (they are one), but they behave in slightly different ways.

{webr-r, eval=TRUE} police

The most observable difference is that tibbles will only print 10 rows and the columns that will fit in your console. When they print, they print a list of column names and the types of the columns that are shown.

To view the preview each column, it’s type, and and the first entries, use glimpse():

When using [] notation to subset them, they will always return a tibble. In contrast, data frames sometimes return a data frame and sometimes return just a vector.

dplyr

dplyr is the core package of the tidyverse. It includes functions for working with tibbles (or any data frames). While you can still use base R operations on tibbles/data frames, such as using $ and [] subsetting like we did above, dplyr provides alternatives to all of the common data manipulation tasks.

Here, we’re just going to look at the basics of subsetting data to get a feel for how tidyverse functions typically work. Next session, we’ll get into variations on subsetting data and some other dplyr functions.

Before we start, let’s remember what columns are in our data:

select

The select() function lets us choose which columns (or variables) we want to keep in our data.

The data frame is the first input, and the name of the column is the second. We do not have to put quotes around the column name.

If we want to select additional columns, we can just list the column names as additional inputs, each column name separated by commas:

As with [] indexing, columns will be returned in the order specified:

We could also use the column index number if we wanted to instead. We don’t need to put the values in c() like we would with [] (but we could).

Yes, there are other ways to specify which columns you want. We’ll cover those next session.

EXERCISE 2

Remember: you need to have loaded tidyverse, and the police data, so execute the cells above.

Convert this base R expression: police[,c("violation", "citation_issued", "warning_issued")] to use select() instead to do the same thing:

filter

To choose which rows should remain in our data, we use filter(). As with [], we write expressions that evaluate to TRUE or FALSE for each row. Like select(), we can use the column names without quotes.

Note that we use == to test for equality and get TRUE/FALSE output. You can also write more complicated expressions – anything that will evaluate to a vector of TRUE/FALSE values.

Variables (columns) that are already logical (TRUE/FALSE values), can be used to filter:

EXERCISE 3

Use filter() to choose the rows where subject_race is “white”.

The equivalent base R expression would be police[police$subject_race == "white",].

slice

Unlike select(), we can’t use row numbers to index which rows we want with filter. This gives an error:

If we did need to use the row index (row number) to select which rows we want, we can use the slice() function.

We don’t usually use slice() in this way when working with dplyr. This is because we ideally want to be working with well-structured data, where we can reorder the rows without losing information. If reordering the rows in the dataset would result in a loss of information (it would mess up your data), then the dataset is missing an important variable – maybe just a sequence index. You should always be able to use a variable to order the data if needed.

Pipe: Chaining Commands Together

So, we can choose rows and choose columns separately; how do we combine these operations? dplyr, and other tidyverse, commands can be strung together is a series with a |> or %>% (say/read: pipe) operator. If you are familiar with working in a terminal/at the command line, it works like a bash pipe character |. It takes the output of the command on the left and makes that the first input to the command on the right.

This works because the functions all take a data frame as the first input, and they return a data frame as the output.

We can rewrite

as

and you’ll often see code formatted, so %>% is at the end of each line, and the following line that are still part of the same expression are indented:

The pipe comes from a package called magrittr, which has additional special operators in it that you can use. The keyboard shortcut for %>% is command-shift-M (Mac) or control-shift-M (Windows).

We can use the pipe to string together multiple commands operating on the same data frame:

We would read the %>% in the command above as “then” if reading the code outloud: from police, select subject_race and subject_sex, then filter where subject_race is white.

This works because the dplyr functions take a tibble/data frame as the first argument (input) and return a tibble/data frame as the output. This makes it easy to pass a data frame through multiple operations, changing it one step at a time.

Order does matter, as the commands are executed in order. So this would give us an error:

Because subject_race is no longer in the data frame once we try to filter with it. We’d have to reverse the order:

You can use the pipe operator to string together commands outside of the tidyverse as well, and it works with any input and output, not just data frames:

EXERCISE 4

Select the date, time, and outcome (columns) of stops that occur in beat “71” (rows). Make use of the %>% operator.

The equivalent base R expression would be: police[police$beat == "71", c("date", "time", "outcome")]

Hint: remember that a column needs to still be in the data frame if you’re going to use the column to filter.

Note that so far, we haven’t actually changed the police data frame at all. We’ve written expressions to give us output, but we haven’t saved it.

Sometimes we may still want to save the result of some expression, such as after performing a bunch of data cleaning steps. We can assign the output of piped commands as we would with any other expression.

EXERCISE 5

Select only vehicle_year and vehicle_make columns for observations where there were contraband_weapons

Recap

We learned what tibbles are, the dplyr equivalents of indexing and subsetting a data frame, and the pipe %>% operator.

Next time we’re going to look at some more complicated use cases for select, filter, and slice, as well as learn mutate to create or change variables in our datasets.

Acknowledgements

This tutorial was adapted by Brennan Antone from original tidyverse tutorial by Christina Maimone.